Did AI get more negative recently?

In this paper, we classify scientific articles in the domain of natural language processing (NLP) and machine learning (ML), as core subfields of artificial intelligence (AI), into whether (i) they extend the current state-of-the-art by the introduction of novel techniques which beat existing models or whether (ii) they mainly criticize the existing state-of-the-art, i.e. that it is deficient with respect to some property (e.g. wrong evaluation, wrong datasets, misleading task specification). We refer to contributions under (i) as having a ‘positive stance’ and contributions under (ii) as having a ‘negative stance’ (to related work). We annotate over 1.5 k papers from NLP and ML to train a SciBERT-based model to automatically predict the stance of a paper based on its title and abstract. We then analyse large-scale trends on over 41 k papers from the last approximately 35 years in NLP and ML, finding that papers have become substantially more positive over time, but negative papers also got more negative and we observe considerably more negative papers in recent years. Negative papers are also more influential in terms of citations they receive.


Introduction
Deep learning has revolutionized machine learning (ML) and natural language processing (NLP) in the last decade. In particular, deep learning has led to unprecedented performance gains on a large number of NLP and ML tasks, including machine translation [1], image classification [2], natural language understanding [3] and text generation [4].
On the other hand, there has seemingly also been a recent surge of papers highlighting limitations of (deep learning) approaches, including claims about models exploiting dataset biases [5], flawed evaluation [6] and general 'troubling trends' in ML practice [7].
Indeed, from a historic perspective, deep learningformerly known under the name of 'artificial neural networks'is a prime exemplar of a technology that has received very mixed assessments over time, ranging from initial hype to negative and positive appraisal in repeating cycles [8,9].
Motivated especially by (recent) observations of negative assessment of individual papers regarding the existing literature and its claims [10], we define a new NLP task of determining the stance of a paper (with respect to its related work). 1 We take a prototypical paper of negative stance to be one that concludes that related work is basically flawed, mistaken and potentially based on false assumptions (cf. table 1); negative papers could also be referred to as critique papers, which disrupt current knowledge. By contrast, a prototypical paper of positive stance is one that generally accepts the premises of related work (even though it may identify specific-minor-issues and weaknesses), extends it and sets a new state-of-the-art. Positive papers could also be referred to as extension papers, which build upon and extend current knowledge. 2 (Any particular paper may mix positive and negative elements, so we treat the task as a continuous regression problem. ) We hold this task important in order to be able to analyse trends in science and its evolution [14], which could potentially anticipate pessimistic future developments (the end of the party). By comparing trends across two core disciplines of artificial intelligence (AI) (arguably one of the most dynamic, promising and intriguing current research fields)-ML and NLP-we can also contrast the evolution and current state of each. The task is also timely, as interest in the analysis of scientific literature in the NLP community has been steadily on the rise recently (cf. §2).
To our best knowledge, we are the first to tackle stance identification on the level of abstracts which, in contrast to individual citations, is (i) more efficient, (ii) less ambiguous, and (iii) focuses on the authors' core message. We are also the first to measure the evolution of two important scientific subfields (NLP and ML) with respect to stance over time and relate stance to important scientific success measures, i.e. citation counts and acceptance chances at venues. Our contributions: we define the new task of stance detection for scientific literature; we provide a human-annotated dataset of over 1.5 k scientific papers, labelled for their stance; we provide a large-scale trend analysis on over 41 k papers from the NLP and ML community in the past approximately 35 years; we address various trend questions including (i) whether negativity has recently increased, (ii) whether positive/negative papers are more influential, and (iii) whether positive/negative papers are more likely to be accepted.
We point out that 'negativity', which is a focus of our work, plays a central role in various contexts: in the social sciences, signed social networks are networks in which agents have positive and negative relations to each other, potentially explaining phenomena such as long-term disagreement [16,17]; in science, negative citations may (arguably) be a form of self-correction [18], and publishing negative results Table 1. Example of a paper with negative stance. The underlying paper is Niven & Kao [5]. Negative elements are highlighted by us in red.
abstract (excerpt) 'We are surprised to find that BERT's peak performance of 77% on the Argument Reasoning Comprehension Task reaches just three points below the average untrained human baseline. However, we show that this result is entirely accounted for by exploitation of spurious statistical cues in the dataset. We analyse the nature of these cues and demonstrate that a range of models all exploit them. This analysis informs the construction of an adversarial dataset on which all models achieve random accuracy.
[···]' 1 This concept is related to positive/negative citations within a paper, which have been annotated in a few works, e.g. Teufel et al. [11]; Athar & Teufel [12]; Abu-Jbara et al. [13]; Catalini et al. [14]; Yousif et al. [15]. Our work goes beyond individual citations and assesses the stance of the authors' main message. 2 We note that there is an asymmetry between positive and negative papers in our definition: while negative papers explicitly relate to related work, positive papers especially relate to themselves, highlighting their own positive contributions. Nonetheless, we take positive papers to implicitly accept the premises of related work, thus they have an implicit positive stance to it. We discuss more on this in §4. Calling positive papers incremental, as one reviewer suggested, would seem like a biased judgement concerning their quality.
royalsocietypublishing.org/journal/rsos R. Soc. Open Sci. 10: 221159 may reduce waste on resources for disputed approaches [19]; in economics, the principle of creative destruction [20] may explain the progress of capitalism. 3

Related work
Historically, analysis of scientific literature is the scope of the fields of scientometrics and science-of-science. Classical results include the relation between title length and the number of citations a paper receives [21] as well as quantitative laws underlying citation patterns or the number of co-authors of papers over time [22]. Sienkiewicz & Altmann [23] relate textual properties of abstracts and titles of scientific papers to their popularity and find that the complexity of abstracts is positively correlated with citation counts, but abstract and title sentiment (measured as average valence/arousal of all words) are weakly correlated. 4 In recent years, with the rise in quality of models and approaches, more and more NLP approaches are also devoted to the analysis of scientific literature. We list several relevant studies in the following. Gao et al. [24] ask how much the author response in the 'rebuttal phase' of the peer review process influences the final scores of a reviewer, finding its impact to be marginal, especially compared with the scores of other reviewers. Pei & Jurgens [25] study (un)certainty in science communication, finding differences among journals and team size.
Prabhakaran et al. [26] predict whether a scientific topic will rise or fall in popularity based on how authors frame the topics in their work. They use a subset of the Web of Science 5 Core Collection with papers from 1991 to 2010 and analyse abstracts by assigning scientific topics (e.g. stem cell research) as well as rhetorical roles (i.e. scientific background, methods used, results etc.) to phrases. They find that topics that are currently discussed as results and background are at their peak and tend to fall in popularity in the future, whereas topics that are mentioned as methods or conclusions tend to start to rise in popularity.
Arguably the paper most similar to ours is Jurgens et al. [27]. They study the entire content of a scientific publication in order to predict its future impact, based on how citations are framed. They distinguish different functions of citations: BACKGROUND (the other work provides relevant information), USES (usage of data, methods, etc. from the other work), and COMPARISON OR CONTRAST (express similarities/differences to the other work). They analyse the evolution of the functions over (i) the course of a paper, (ii) different venues, and (iii) time. They show that NLP has seen considerable increase in consensus when authors started to use fewer COMPARISON OR CONTRAST citations and simply acknowledged previous work as BACKGROUND. The authors argue that these trends imply that NLP has become a rapid discovery science [28], i.e. a particular shift a scientific field can undergo when it reaches a high level of consensus on its research topics, methods and technologies, and then starts to continually improve on each other's methods. Our approach differs from Jurgens et al. [27] in several ways: e.g. we do not analyse individual citations, but directly evaluate the stance of a complete paper (as measured by its framing in the paper's abstract); most importantly, we are particularly interested in negative stances, which as relation is absent in the scheme of Jurgens et al. [27].
A number of more recent papers also leverage or study individual citation context, including Cohan et al. [29]; Jebari et al. [30]; Wright & Augenstein [31]; Lauscher et al. [32]. We note that annotating individual citations is more costly than annotating the abstract of a paper, as we do. Annotating individual citations is also a complex task which requires context for disambiguation-as neglected in most previous work [32]-and even then may involve looking up the cited work to understand the citation role. There may also be a bias against direct negative citations of individual works, cf. also Bordignon [18]. Sentiment analysis (e.g. polarity) for individual citations is surveyed in Yousif et al. [15]. For example, Abu-Jbara et al. [13] define a positive citation as one that explicitly states a strength of the target paper. A negative citation points to a weakness and descriptive citations are marked as neutral. We adopt a similar scheme, but apply it to the core message instead of individual citations; we also define positive papers differently 6 -i.e. they make a positive contribution in terms of proposing new techniques extending existing work (thereby implicitly accepting its premises) and 3 Interestingly, the publication of our paper coincides with a new hype in the AI community surrounding the release of ChatGPT, a very high-quality conversational dialogue model. 4 They use a lexicon lookup to measure valence/arousal, which cannot deal with the contextual usage of words, e.g. the distinction between 'fail' and 'not fail'. By contrast, we learn a model on human annotations to determine the stance of a paper and our notion of stance does also not coincide with sentiment, see §6.4. We finally target very different scientific venues, i.e. ML/NLP versus Web of Science. setting a new state-of-the-art. Catalini et al. [14] study the role of negative citations and find (among others) that they tend to decrease citation counts of the cited paper over time. We study the dual question: whether negative papers (in our sense) receive more citations. Lamers et al. [33] study disagreement in science across diverse fields, which is a related concept to that of negative citations, finding that there is highest disagreement in the social sciences and humanities, and lowest disagreement in mathematics and computer science (of which AI is a subfield).
Beyond classification, Yuan et al. [34] use NLP models to automatically generate reviews for scientific papers. They conclude that their review generation model is not good enough to fully automate the reviewing process, but could still make the reviewer's job more effective. Wang et al. [35] create automatic reviews for papers by defining multiple knowledge graphs, one extracted from the paper, one from the papers the paper cites, and one for background knowledge. Beltagy et al. [36] train a language model (SciBERT; extending BERT) on a large multi-domain corpus of scientific publications.

Data
We extract our data from two sources: (i) the ACL Anthology 7 which contains papers and metadata for all major NLP events, and (ii) ML conferences.

NLP dataset
From the ACL Anthology, we extract papers from eight different NLP venues between 1984 and 2021. 8 For all venues, we only include papers from the main conference and exclude papers from workshops (by manually selecting the volumes) and contributions like book reviews and title indices (by filtering the titles). To extract the data, we download the provided metadata from the ACL Anthology website in the BibTeX format that contains information of authors, title, venue and year. We then use Allen AI's Science Parse 9 to extract abstract information and collect citation information from Semantic Scholar. 10 We refer to this dataset as NLP in the following. NLP contains more than 23 k papers in total. The distribution over the venues is shown in table 2.

ML dataset
We download papers from the respective websites of NeurIPS 11 , AAAI 12 , ICML 13 and ICLR 14 , and then use Science Parse to extract abstracts and Semantic Scholar to collect citation information as above. ML contains over 18 k papers between 1989 and 2021. The distribution over the venues is given in table 3.

Data annotation
We annotate the data from NLP and ML for each paper's stance (towards related work), as given in the authors' framing in a paper's abstract. In contrast to some related work, we do not annotate stance towards individual citations, but infer the authors' stance from the paper abstracts and titles as we are interested in the stance of the authors' overall core message. Our focus on title and abstract is a deliberate choice resting on the following observations. (i) In abstracts, authors typically condense the most important information they intend to convey. (ii) This agrees with the insight that abstracts and titles are typically the only piece of the paper that the majority of readers consumes [37]. (iii) Annotating and classifying individual citations is also more costly, cognitively demanding and 6 It would not make sense to define a paper as positive if it explicitly lauds previous work as its main message; such papers would probably be rejected as having no additional value to the field of science. 7 See https://aclanthology.org/. 8 We find by introspection that papers before 1984 often have different formatting and may even lack specifically delineated abstracts. For the same reason, we exclude papers from the CL journal before 1986. 9 See https://github.com/allenai/science-parse/. In some instances, that could not be removed automatically, Science Parse included other parts of the paper. This leads to increased uncertainty especially for papers published before the year 2000. royalsocietypublishing.org/journal/rsos R. Soc. Open Sci. 10: 221159 ambiguous, as outlined in §2. (iv) We are also not interested in whether individual citations are positive or negative but in whether the whole paper (the core message) is framed positively or negatively.

Definition of stance
On a coarse-grained level, we consider three possible stances: (i) we define a prototypical paper of positive stance (towards related work) as one that directly or indirectly accepts its premises, builds upon and extends the existing literature and achieves a new state-of-the-art; such a paper contains phrases and sentences such as 'we present a new approach'; 'we beat the state-of-the-art'; 'we release a new dataset' (cf. appendix table 6). (ii) A prototypical paper of negative stance towards related work states that datasets, evaluation protocols or techniques are basically flawed; such a paper contains sentences of negative sentiment such as 'techniques fail'; 'approaches are limited'; 'models are unstable'. (iii) If a paper does not particularly fit into this categorization, e.g. because it discusses or summarizes previous work without criticizing it, then we consider a paper as expressing a neutral stance; survey or analysis papers typically fall into this category.
In the annotation, we relax this coarse-grained scheme and instead allow continuous numbers as stances, ranging from −1 (very negative stance) towards +1 (very positive stance), with 0 as neutral stance. 15 Intuitively, the more negative/positive statements an abstract contains, the more negative/ positive it is. The severity of statements also matters for the degree of positivity or negativity, e.g. 'techniques fail completely' is more negative than 'some techniques don't work properly'. Positive/ neutral and negative papers are differentiated by the amount of direct criticism they express against existing work: each element of criticism (and its severity) would decrease the positivity of a paper. Neutral and positive papers are differentiated in that the latter has clearly positive elements of advancing the state-of-the-art and providing better solutions.
We provide guidelines for the annotation in the appendix as well as several examples in table 7. In the following, we describe the annotation process and provide statistics.

Annotation statistics and procedure
In total, we manually annotated 1550 papers from NLP and ML. The distribution of papers over venues is given in table 4. In this human-annotated dataset, the ACL conference, which takes place annually since 1979, has most papers (225), followed by NeurIPS (210) and AAAI (205). In our human-annotated data, 1018 papers (65.7%) exhibit positive stance (greater than or equal to 0.1), 277 papers (17.9%) exhibit negative stance (less than or equal to −0.1) and 255 papers (16.5%) are neutral (∈(− 0.1, + 0.1)). 16 We show the more detailed distribution of papers in terms of stance in figure 1.
This statistic does not reflect the true distribution of stance in our data (which is dominated by positive papers, as we will show below), as we oversampled negative papers using heuristics (e.g. looking for particular keywords in abstracts and titles such as fail and limitation and titles with question marks). Candidates for positive papers were randomly drawn. We did this in order to ensure classifiers trained on datasets that are not too class imbalanced. Manipulating the (training) data distribution is a common approach to ensure better classifiers in the face of small minority classes [38]; our heuristics-based approach also makes annotation much more efficient, as we would otherwise have to annotate considerably more data to obtain negative instances. We note that we evaluate our models also on the 'natural' data distribution, not only the skewed one; see §6.

Inter-annotator agreement
We had up to four annotators annotate abstracts for stances. The annotators were computer science undergraduate students and one computer science faculty member from NLP. Seventy-one per cent of the human-annotated dataset is annotated by one, 24% by two, 2% by three and 1% by four annotators. This distribution reflects the fact that annotating all instances by all annotators would have been too costly and, given good agreements, also not necessary. Newly incoming annotators first annotated already annotated instances-in order to measure agreement, i.e. their task understanding-then proceeded to annotate independently. We label each abstract's stance as the average over all the annotators.
We measure agreement on stance annotation using Pearson's correlation coefficient, Cohen's kappa coefficient and Krippendorff's alpha coefficient. The resulting Pearson correlations among all pairs of annotators (on a common set of instances) range from 0.64 to 0.94 (avg.: 0.77). On a coarse level with three stances, the kappa agreement is between 0.53 and 0.87 (avg.: 0.66). The alpha agreement among all annotators is 0.74 on a ratio scale. Overall, we thus have good agreement. We illustrate interannotator agreement (kappa, Pearson) and the number of common instances in figure 2. In the appendix, we explore how changing the ranges for neutral, positive and negative papers affects our results that relate to a coarse level. In particular, we there define neutral papers as ones that fall into ( − 0.2, + 0.2), with corresponding implications for negative and positive papers. Overall, we find that our results are very similar for such slightly different range definitions.

Historic versus modern data
We refer to NLP and ML papers published before the year 2000 as historic papers (Hist) and papers published since 2000 as modern papers. The historic dataset consists of 246 papers, of which 194 belong to NLP and 52 to ML. Modern NLP consists of 578 and modern ML of 726 papers.

Model
For all our experiments, we use SciBERT [36]. We feed each paper as concatenation of title and abstract separated by special tokens to SciBERT: [ CLS] <title> [ SEP] <abstract>. We set the maximum token length to 300, which is sufficient for most papers and a good compromise for efficient memory usage. We add a fully connected layer with one output neuron and linear activation on top of the pooled output to obtain a single prediction for the stance of a paper. Since we define stance as a value between −1 and +1, we clip the prediction to the desired range.
The model is fine-tuned with the following hyperparameters: a batch size of 8 or 16, a slanted triangular learning rate [39] with a maximum learning rate of 1 × 10 −5 , 2 × 10 −5 or 5 × 10 −5 , a warm-up ratio of 0.06, and linear decay. We train for 2, 3 or 4 epochs and optimize using Adam [40] with an e of 1 × 10 −6 , β 1 of 0.9, β 2 of 0.999 and the mean squared error (MSE) as the loss function.
For our experiments in §6, we train 18 models by performing a full grid search over the specified hyperparameters and keep the best model based on the MSE on the dev set. We repeat this five times and calculate performance metrics (cf. §6.1) as the average score of those models.

Experiments
In the following, we first verify the reliability of our stance detection model described in §5. To do so, we assess its cross-domain and in-domain performance and compare it with several baselines. Once the quality of the model is assured, we apply it large-scale to determine trends over time and venues in §7.

. Metrics
We use various metrics to evaluate our models. The coefficient of determination (R 2 ) is similar to the MSE but also takes the distribution of the data into account, which makes it more informative and truthful than the MSE [41]. A model that always predicts the expected value has an R 2 score of 0. The range of the metric is (−∞, 1]. The macro F1 score is a standard metric to assess the quality of multiclass classification which can take class-imbalance into account. We compute the macro F1 score on coarse-grained stance labels ( positive, negative, neutral), see above. We also calculate the F1 score for the labels individually. The natural macro F1 score samples papers according to the natural distribution of the data (i.e. mostly positive), as predicted by our best performing model. To do this, we draw the test sets randomly (from the existing test sets) according to the true distribution 17 and then calculate the macro F1 score on the three labels. Since randomness is involved, we repeat this 1 k times and average the scores.

Baselines
We compare our models with simple baselines. POS: always predict a positive stance +1; ZERO: always predict a neutral stance 0; NEG: always predict a negative stance −1; AVG: always predict the average of manual annotations.

Cross-domain experiments
Due to the exponential growth of science, our model is mostly trained on more recent data. However, we also want to make sure that we obtain reliable predictions, e.g. for past papers. As a consequence, we first evaluate our model in a cross-domain setting. In this, we train our model on papers from different time stamps or domains and evaluate on a respective out-of-domain test set. For the source data, we set the train-dev split ratio to 0.7 and 0.3, and we use the whole annotated data for the target data.

In-domain experiments
We also perform in-domain tests where we train and test on the same domain of data (e.g. modern NLP). We set the train-dev-test ratio to 0.6/0.1/0.3.

Combined experiments
We create combined train and dev sets which consist of NLP, ML and Hist papers and evaluate on each domain individually. We set the train-dev-test ratio to 0.6/0.1/0.3.

Results
Results are shown in figure 3. We observe clear trends: in-domain and cross-domain performance are typically close, but cross-domain performance is better on average (note that cross-domain uses between 1.3 and 6.2 times as much training data as in-domain; cf. table 5). The model trained on combined data, which uses even more data, outperforms in-domain and cross-domain results. Indomain and cross-domain performance on Hist is lowest, which is not surprising as this dataset is smallest in size and assumedly has largest divergence to the modern datasets, due to temporal 17 We are aware that the natural distribution, as we define it, relies on our models' predictions. As these are averages over many instances, we think that our models' predictions can be relied upon at aggregate level; see our results below.
royalsocietypublishing.org/journal/rsos R. Soc. Open Sci. 10: 221159 divergence [42]. On average, the models are best in predicting the positive class, and worst in predicting the neutral class. It is interesting to note that a model trained on a combination of all sources and time periods performs best. It even leads to good results on the Hist portion of our data, which is why we use it for the analysis below.

Error analysis
We further assess the quality of our best performing combined model by comparing the human annotated and predicted stance in the test set using the confusion matrix shown in figure 4. Positive papers are correctly predicted in most cases. Neutral and negative papers are more frequently confused as positive and when the model predicts a paper to be negative the true class is sometimes neutral.
In appendix table 8, we illustrate sample predictions of our model, randomly sampled from a predicted stance of more than 0.8 (very positive papers) or below −0.7 (very negative papers). We note that the predicted values look very plausible overall. However, especially for the positive papers, the model misses some negative elements, thus overestimates their positivity; arguably the negative papers are also slightly more negative than the model predictions.
We further conduct a manual analysis on papers that have a large discrepancy between predicted and human stance. In these cases, we find that the model often predicts very negative papers as very positive. This may be related to positive papers being the majority class. In several cases, the model was also seemingly led astray by small positive contributions, especially in last sentences, and by superficial cues indicating positive contributions such as we propose. One pattern for the misclassified papers is strong criticism followed by the release of a new dataset indicated in the last sentence. In some of the error cases, the gold standard is also incorrect or too extreme, e.g. too strongly negative.  18 a text processing library, and (iv) a valence/arousal lexicon [46]. We evaluate the similarity of stance and sentiment using Pearson's correlation coefficient, which ranges from 0.01 to 0.27. We interpret the low correlation in two ways: (i) it may be the result of different domains for the training data, which for many sentiment models are texts from social media, while we use scientific texts; (ii) it indicates that our definition of 'stance' involves different nuances than simple 'sentiment'. For example, text such as 'we propose a new model' indicates positivity in our context, but may be considered neutral sentiment. Similarly, 'we identify limitations in existing works' may be considered neutral sentiment polarity, but indicates negativity in our context.

Analysis
We analyse large-scale trends from the combined model's predictions and smooth the graphs with Gaussian blurs. 19 We further use Welch's t-test [47] and the Kruskal-Wallis H-test [48] to detect differences in the distributions and to test our hypotheses; we report the achieved significance levels in parentheses. Our first questions connect to the recent paper of Bowman [10], who observes 'a wave of surprising negative results' in recent years in the NLP community, partly confirming his evidence from selected case studies with our large-scale predictions (but partly also putting it in perspective). 7.1. 'Are there more positive or more negative papers?' The histogram of stance values predicted by the model, aggregated over all venues and years, is visualized in figure 5a. It shows that most papers have a positive stance and that the more negative the stance gets, the fewer papers there are. Less than 4% of all papers have a negative stance and more than 3/4 of all papers have a stance of greater than or equal to 0.6. 7.2. 'Are NLP papers more positive/negative than ML papers?' Both hypothesis tests, the t-test and the H-test, show with a significance level of 0.01% that the distribution of predicted stance values differs between NLP and ML. Figure 5b shows the histogram of the predicted stance values, aggregated over all years, for both datasets. The distribution is similar to the overall trend, but ML has more papers with stance values between 0.5 and 0.8, whereas NLP has

18
See https://textblob.readthedocs.io/. 19 The observed trends correlate between our model and a model trained as a reference on the full dataset with 1.5 k papers (no test split) with a Pearson's correlation coefficient from 0.76 to 0.99 (avg.: 0.96) royalsocietypublishing.org/journal/rsos R. Soc. Open Sci. 10: 221159 more papers with a stance of 1.0. Overall, 3.9% of all NLP papers and 2.3% of all ML papers exhibit a negative stance, which makes NLP more negative than ML.

'Did AI get more positive/negative recently?'
We analyse the development of the average stance value over time in figure 6. This shows that the average stance is always positive with a minimum value of 0.34 for NLP in the 1980s. When our ML dataset started in 1989 it was more positive than NLP ( p-value 0.01%, t-test). In the late 1990s, the stance of papers in NLP and ML came closer together, and in the 2000s (when NAACL and SemEval were first held) NLP took over and became slightly more positive than ML ( p-value 0.01%, t-test). The positiveness reached its peak around 2015 with an average stance above 0.80 for NLP. It then started to get more negative for NLP which means that the field of NLP got less positive recently. The ML domain, however, got more positive with a maximum stance of 0.81 in 2021.
Overall, the average stance of both ML and NLP increased substantially from the mid-1980s to the 2020s. This means that papers got more positive on average, i.e. build upon another and report better and better results (i.e. new state-of-the-art performances), confirming the observation that ML and NLP have become 'rapid discovery sciences' [27].
We further analyse whether the increase in positiveness from 1990 to 2010 and the decrease in positiveness in the most recent years for NLP comes from more positive/negative papers overall or from positive/negative papers getting less/more positive/negative. Figure 7 shows how many papers have a negative stance in each year. We observe that negativity has peaked in the 1980s and 1990s for NLP and ML, respectively. There was then a continuous downward trend in negativity until the 2010s. In the recent years, negativity has increased for both domains, but considerably more sharply so for NLP, from 2% of all papers in 2015 to 4%.   royalsocietypublishing.org/journal/rsos R. Soc. Open Sci. 10: 221159 Figure 8 shows the average stance value of all positive papers and the average stance value of all negative papers over time. The development of the average positive stance value is very similar to the development of the average stance value overall. By contrast, the average negative stance value has a decreasing trend for both datasets which means that negative papers have become more negative over time.
Finally, we analyse trend curves for individual venues, visualized in figure 9. The trend towards more negative papers in the most recent years is visible for most venues, especially ACL, EMNLP, COLING, NAACL, CoNLL and ICML. TACL has the sharpest increase. SemEval and AAAI do not follow this trend, however. Many venues were more negative before the 2000s and least negative in the 2000s. The CL journal is a noticeable outlier: it is the most negative venue with up to 16% negative papers (p-value 0.01%, t-test); we note that it is also the only journal in our dataset besides TACL. 7.4. 'Do positive/negative papers receive more/fewer citations?' Figure 10 shows how many citations a paper with a certain stance value has received in comparison with papers published in the same year. We compare citation counts using normalized values that indicate how many citations more or less a paper has received in comparison with the average number of citations a paper published in the same year has received, measured in multiples of the standard deviation of citation counts in that year. Positive values indicate more citations than the average, negative values indicate fewer citations. The graph shows that papers with a negative stance of −0.3 or less receive more citations than the average paper in the same year (p-value 3%, t-test); very negative papers receive even more citations. By contrast, a paper with a positive stance receives less  royalsocietypublishing.org/journal/rsos R. Soc. Open Sci. 10: 221159 citations on average ( p-value 6%, t-test), but very positive papers with a stance value of more than 0.8 receive slightly more citations ( p-value 3%, t-test). Similar results can be found by analysing the individual domains, NLP and ML, separately, which is shown in figure 11. The domain of ML is more extreme than NLP in that negative papers with a stance of −0.5 or less receive even more citations (p-value 6%, t-test) and positive papers with stance values from 0.1 to 0.7 even fewer ( p-value 0.1%, t-test).
Overall, this shows that papers of negative stance seem to attract more citations than papers of neutral or positive stance, indicating that they receive more attention and have larger effect on the community. Together with Catalini et al. [14], this means that a paper of negative stance receives more citations but decreases the citation counts of papers that it cites negatively, which may indicate that it shifts attention from those papers to itself. The fact that papers of negative stance receive more citations would make stance also suitable for inclusion as feature in models that predict citation count [49].  royalsocietypublishing.org/journal/rsos R. Soc. Open Sci. 10: 221159 However, we acknowledge that also other factors may be at work here, e.g. that stance may only be a confounding variable. For example, it could be that high-prestige authors (e.g. measured in terms of their h-indices) tend to be over-represented in papers of negative stance; high-quality authors, in turn, may attract more citations ( potentially because they write better papers), which could explain the link between citations and stance. To analyse the relationship between stance and citations in more depth, we performed a linear regression similar to Sienkiewicz & Altmann [23] by measuring various factors (see (a-d ) below), including the length and complexity of titles and abstracts, as well as a paper's stance, to predict its citation count. We could not reproduce their results because the goodness-of-fit of the regression was low, indicating that linear models are not the adequate choice in our case. Instead, we analyse which factors differ between very positive (greater than or equal to 0.8), very negative (less than or equal to −0.8) and neutral papers (∈(− 0.1, + 0.1)) on a subset of 39 k papers for which we have all required metadata, which leads to 24.1 k very positive, 69 very negative and 874 neutral papers. Similarly to Sienkiewicz & Altmann [23], we explore the following factors: (a) length (characters in title, and words in abstract), (b) complexity (Herdan's C index [50], z-score [51] and Gunning fog index [52]), (c) sentiment (average valence/arousal [46]) and (d) author information (number of authors, mean/minimum/maximum h-index of authors 20 ) for papers of positive, negative, and neutral stance. We standardize all factors per venue and show the mean of each factor together with the 95% confidence interval in figure 12. While the length of the title does not reveal a clear trend, longer abstracts are more common in neutral papers. The variance for complexity factors is generally too high to draw conclusions, but titles with a high z-score, i.e. high complexity, tend to be more common in very negative papers. Very negative papers have higher arousal and neutral papers have lowest arousal. Author information separates positive, negative and neutral papers best. In contrast to our initial hypothesis expressed above, very negative papers have authors with lower (current) h-indices.

'Do positive/negative papers have lower/higher acceptance chances?'
We use 5453 accepted and 14 974 rejected papers from the ICLR conferences in 2013 and 2017-2021, collected from OpenReview, 21 and from ACL, EMNLP, NAACL, NeurIPS, AAAI, ICML and ICLR over the years 2007-2017 as collected by Kang et al. [53]. The t-test and the H-test show with a significance level of 0.01% that the distribution of predicted stance values differs between accepted and rejected papers. Figure 13 shows how many papers with a certain stance value were accepted. The trend indicates that papers with a negative stance of −0.6 or less have higher acceptance rates ( p-value 1%, t-test). The acceptance rate for papers with stance values between −0.6 and 0.8 is lower  Figure 11. Normalized number of citations a paper with a certain stance and domain has received, average number of normalized citations and 95% confidence interval. 20 We use h-indices of a paper's authors we collected from Semantic Scholar in late 2021, i.e. these are not historic h-indices from the time of publication of a paper. 21 See https://openreview.net/.
We also calculate the acceptance rates for two separate time spans, 2007-2014 and 2015-2021, as shown in figure 14. The t-test and the H-test show that the acceptance chances are different in the two time spans with a significance level of 0.01%. For the most recent years 2015-2021, the trend is similar to the overall trend, including that papers with a very positive stance of more than 0.8 are more likely to be accepted (p-value 0.1%, t-test), but those papers do not have much higher chances. However, the acceptance rates in earlier years 2007-2014 were different. Papers with a very positive stance of 0.8 or more used to have better acceptance chances than very positive papers in 2015-2021 (p-value 0.1%, t-test). This is consistent with the trend over time, which shows that fewer papers were negative in the years 2007-2014 than in 2015-2021, implying a bias: positive papers were more popular back then and therefore more positive papers were more likely to be accepted.

Concluding remarks
We analysed stance in abstracts of scientific publications, where authors position themselves positively or negatively (with respect to related work). We annotated over 1.5 k abstracts from ML and NLP venues and trained a SciBERT model on a subset of the annotated abstracts, verifying that the model is of sufficiently high quality for the task. We then used this model to automatically predict the stance of a paper based on its title and abstract. We applied the model large-scale to a collection of 41 k scientific publications in the domain of NLP and ML from the years 1984 to 2021 to enable large-scale analysis.   Figure 13. Acceptance rate of a submitted paper with a certain stance value, average acceptance rate and 95% confidence interval. The dotted line indicates the overall acceptance rate of 26.7%.
royalsocietypublishing.org/journal/rsos R. Soc. Open Sci. 10: 221159 The analysis revealed that the majority of papers in the past and today have a positive stance, that the average stance has substantially increased over time, yielding support for the hypothesis that ML and NLP have become 'rapid discovery sciences', and that the ML domain is more positive than the NLP domain. Scientific publications used to have a more negative stance in the early days, then became very positive until they started to get more negative again recently. Overall, publications also got more extreme over time, which means that positive papers became more positive and papers with a negative stance more negative. We found (very) negative papers to be more influential than (mildly) positive ones in terms of citations they receive and more likely to be accepted to NLP/ML venues.
We believe that NLP/ML turned more positive when the fields became more statistically solid, starting from the 1990s, and when authors started to build on the existing literature, with a peak of positivity in the mid-2010s, the beginning of the deep learning revolution. This hype has apparently also led to a recent increase in negativity, when papers started to challenge the validity of some of the claims [54], when issues of adversarial robustness [55] and reproducibility [56,57] became apparent and people began to question evaluation frameworks [58].
Our results also inform the recent work of Bowman [10], who warns of the dangers of (what he calls) 'underclaiming papers' (which are negative papers in our terminology) by providing quantitative measures of negative papers over time. Given that negative papers tend to receive more attention (in terms of citations), we point out that they may also be key factors in improving the status-quo (cf. also [14]), which highlights their positive contribution to the scientific process. We also note that while negative papers have indeed sharply increased in numbers in the NLP domain (at least) recently, from a historical perspective, they are still on relatively low level.
Future work should address other scientific disciplines beyond NLP and ML for a broader scientific trend analysis, examine the correlation between overall stance of a paper and individual (negative) citations in its related work sections, 22 annotate word-level rationales for our sentence-level scores, assess the correlation between stance and socio-demographic factors (gender, nationality, affiliation, hindex etc.) and analyse how negative papers may potentially transform a field.
We release our data, code and model on GitHub. 23 Data accessibility. Data Figure 14. Normalized acceptance rate of a submitted paper with a certain stance value for two time spans, average acceptance rate and 95% confidence interval. Normalized values indicate how many percentage points more or less a paper with a certain stance value is likely to be accepted in comparison with the average acceptance rate in each time span. The dotted line indicates the average acceptance rate. 22 We conducted a small-scale experiment on 20 papers whether a negative paper actually addresses previous work in a negative way by comparing a paper's stance, predicted using title and abstract, with the stance of the citations, given in the related work or background section of a paper. To do this, we randomly selected 10 very positive and 10 very negative papers based on our model's predictions and manually annotated the citations (using citation context as in Lauscher et al. [32]). In the sample, we find that a paper's stance and the difference between the amount of positive and negative citations are correlated with a Pearson's correlation coefficient of 0.62. This means that the stance of a paper may indeed be reflected in the way the authors frame citations in their related work. 23 See https://github.com/DominikBeese/DidAIGetMoreNegativeRecently. royalsocietypublishing.org/journal/rsos R. Soc. Open Sci. 10: 221159 Table 6. Top n-grams in the abstract of positive, neutral or negative papers in comparison with abstracts of non-positive, nonneutral or non-negative papers, ranked according to the log-likelihood ratio [59] using the tool from Gao et al. [24]. We use the 1.5 k papers from our human-annotated dataset. EOS denotes the end of sentence.
positive n-grams neutral n-grams negative n-grams we propose computational linguistics we find propose a of the that these we propose a the language and that our approach to be find that   'Syntax-based statistical machine translation (MT) aims at applying statistical models to structured data. In this paper, we present a syntax-based statistical machine translation system based on a probabilistic synchronous dependency insertion grammar. Synchronous dependency insertion grammars are a version of synchronous grammars defined on dependency trees. We first introduce our approach to inducing such a grammar from parallel corpora. Second, we describe the graphical model for the machine translation task, which can also be viewed as a stochastic tree-to-tree transducer. We introduce a polynomial time decoding algorithm for the model. We evaluate the outputs of our MT system using the NIST and Bleu automatic MT evaluation software. The result shows that our system outperforms the baseline system based on the IBM models in both translation speed and quality.' [60] NLP (1.0) 'Product classification is the task of automatically predicting a taxonomy path for a product in a predefined taxonomy hierarchy given a textual product description or title. For efficient product classification we require a suitable representation for a document (the textual description of a product) feature vector and efficient and fast algorithms for prediction. To address the above challenges, we propose a new distributional semantics representation for document vector formation. We also develop a new two-level ensemble approach utilising (with respect to the taxonomy tree) path-wise, node-wise and depth-wise classifiers to reduce error in the final product classification task. Our experiments show the effectiveness of the distributional representation and the ensemble approach on data sets from a leading e-commerce platform and achieve improved results on various evaluation metrics compared to earlier approaches.' [61] (Continued.) royalsocietypublishing.org/journal/rsos R. Soc. Open Sci. 10: 221159 'A large amount of recent research has focused on tasks that combine language and vision, resulting in a proliferation of datasets and methods. One such task is action recognition, whose applications include image annotation, scene understanding and image retrieval. In this survey, we categorize the existing approaches based on how they conceptualize this problem and provide a detailed review of existing datasets, highlighting their diversity as well as advantages and disadvantages. We focus on recently developed datasets which link visual information with linguistic resources and provide a fine-grained syntactic and semantic analysis of actions in images.' [

. Annotation guidelines
We issued the following guidelines to annotators: (i) the stance is a value in the range from −1 (very negative) to +1 (very positive), and (ii) only the title and abstract of a paper is taken into account. A contribution has a positive stance and is annotated with a positive number up to +1 when: (i) it clearly indicates to improve the state-of-the-art by beating existing standards; (ii) it presents novel techniques; (iii) it proposes solutions to problems of previous work; (iv) it gives insights to existing models or methods and explains why they work.
A contribution has a negative stance and is annotated with a negative number up to −1 when: (i) it clearly criticizes previous work for being wrong; (ii) it presents flaws of existing work, i.e. that an approach is deficient with respect to some property; (iii) it analyses errors of other methods and explains why they do not work as expected. tests to determine whether differences in performance are likely to arise by chance, and few examine the stability of system ranking across multiple training-testing splits. We conduct replication and reproduction experiments with nine part-of-speech taggers published between 2000 and 2018, each of which claimed state-of-the-art performance on a widely-used "standard split". While we replicate results on the standard split, we fail to reliably reproduce some rankings when we repeat this analysis with randomly generated training-testing splits.
We argue that randomly generated splits should be used in system evaluation.' [68] NLP (− 1.0) 'The cognitive mechanisms needed to account for the English past tense have long been a subject of debate in linguistics and cognitive science. Neural network models were proposed early on, but were shown to have clear flaws. Recently, however, Kirov and Cotterell (2018) showed that modern encoder-decoder (ED) models overcome many of these flaws. They also presented evidence that ED models demonstrate humanlike performance in a nonce-word task. Here, we look more closely at the behaviour of their model in this task. We find that (1) the model exhibits instability across multiple simulations in terms of its correlation with human data, and (2) even when results are aggregated across simulations (treating each simulation as an individual human participant), the fit to the human data is not strong-worse than an older rule-based model. These findings hold up through several alternative training regimes and evaluation measures.
Although other neural architectures might do better, we conclude that there is still insufficient evidence to claim that neural nets are a good cognitive model for this task.' [69] royalsocietypublishing.org/journal/rsos R. Soc. Open Sci. 10: 221159 Table 8. Predicted highly positive and highly negative papers from our datasets. Blue/red text represents positive/negative rationales; these are added by us.
stance abstract NLP (0.93) 'Attention-based neural models were employed to detect the different aspects and sentiment polarities of the same target in targeted aspect-based sentiment analysis (TABSA). However, existing methods do not specifically pre-train reasonable embeddings for targets and aspects in TABSA. This may result in targets or aspects having the same vector representations in different contexts and losing the context-dependent information. To address this problem, we propose a novel method to refine the embeddings of targets and aspects. Such pivotal embedding refinement utilizes a sparse coefficient vector to adjust the embeddings of target and aspect from the context. Hence the embeddings of targets and aspects can be refined from the highly correlative words instead of using context-independent or randomly initialized vectors. Experiment results on two benchmark datasets show that our approach yields the state-of-the-art performance in TABSA task.' coreference in corpora for such systems. This question highlights the tension that sometimes appears in the development of corpora between linguistic considerations and the aim for perfection on the one hand and practical applications and the aim for efficiency on the other. Many current projects that seek to identify coreferential links automatically, assume an annotation strategy which instructs the annotator to mark a predicative NP as coreferential with its subject if it is part of a positive sentence. This paper argues that such a representation is not linguistically plausible, and that it will fail to generate an optimal result.' [72] ML (−0.94) 'Careful tuning of the learning rate, or even schedules thereof, can be crucial to effective neural net training. There has been much recent interest in gradient-based meta-optimization, where one tunes hyperparameters, or even learns an optimizer, in order to minimize the expected loss when the training procedure is unrolled. But because the training procedure must be unrolled thousands of times, the metaobjective must be defined with an orders-of-magnitude shorter time horizon than is typical for neural net training. We show that such short-horizon meta-objectives cause a serious bias towards small step sizes, an effect we term short-horizon bias. We introduce a toy problem, a noisy quadratic cost function, on which we analyze short-horizon bias by deriving and comparing the optimal schedules for short and long time horizons. We then run meta-optimization experiments (both offline and online) on standard benchmark datasets, showing that meta-optimization chooses too small a learning rate by multiple orders of magnitude, even when run with a moderately long time horizon (100 steps) typical of work in the area. We believe short-horizon bias is a fundamental problem that needs to be addressed if metaoptimization is to scale to practical neural net training regimes.' [73] royalsocietypublishing.org/journal/rsos R. Soc. Open Sci. 10: 221159 Contributions that have positive and negative parts are annotated with a value between −1 and +1, taking into account the following: (i) the importance of individual parts matters, i.e. 'some problems' is less negative than 'fails to work'; (ii) the amount of positive and negative parts matters; (iii) the last sentence of an abstract is usually the most important sentence. If the last sentence of a contribution is positive, it is more positive than a contribution with a negative last sentence.
Contributions that fall outside this labelling scheme are neutral and annotated with a 0. Those include: (i) contributions that explore existing work without beating other systems or explaining why it works or does not work; (ii) contributions that compare, discuss, study or summarize existing work without criticizing it.

A.2. Range definitions
We show that our trends are similar with different thresholds for positive (greater than or equal to 0.1), negative (less than or equal to −0.1) and neutral (∈( − 0.1, + 0.1)) papers by recreating figures 7 and 9 with alternative thresholds for positive (greater than or equal to 0.2), negative (less than or equal to −0.2) and neutral (∈( − 0.2, + 0.2)) papers, cf. figures 15 and 16.    Figure 16. Percentage of negative papers, calculated as the number of negative papers divided by the total number of papers in each year, for each venue on a logarithmic scale with alternative thresholds.